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Abstract — In this paper an approach to modelling non- 
stationary binary sequences, i.e., predicting the probability of 
upcoming symbols, is presented. After studying the prediction 
model we evaluate its performance in two non-artificial test cases. 
First the model is compared to the Laplace and Krichevsky- 
Trofimov estimators. Secondly a statistical ensemble model for 
compressing Burrows-Wheeler- Transform output is worked out 
and evaluated. A systematic approach to the parameter optimiza- 
tion of an individual model and the ensemble model is stated. 

Index Terms — data compression; sequential prediction; param- 
eter optimization; numerical optimization; combining models; 
mixing; ensemble prediction 

I. Introduction 

A. Background 

Sequential bitwise processing plays a key role in sev- 
eral general-purpose lossless data compression algorithms, 
including Dynamic Markov Coding (DMC) [1|, Context Tree 
Weighting (CTW) [2| and the recently emerging "Pack" (PAQ) 
0, family of compression algorithms. All of these algo- 
rithms belong to the class of statistical data compression algo- 
rithms, which split the compression phase into modelling and 
coding. A statistical model assigns probabilities to upcoming 
symbols and these are translated into corresponding codes. 
Assigning a high probability to the actually upcoming symbol 
leads to a short encoding, thus producing compression. The 
ideal code length corresponding to a prediction can closely 
be approximated via Arithmetic Coding (AC) 0. Hence 
improving prediction accuracy is crucial for compression. 

Recently, PAQ-based compression algorithms have been 
of high public interest, due to the enormous compression 
achieved. Unfortunately, there is little up-to-date literature on 
the internals of the involved algorithms 0, H, @. PAQ 
compression algorithms combine multiple binary predictors 
and are characterized by low processing speed and the best 
compression rates in multiple benchmarks up to date. Ensem- 
ble prediction has previously been applied successfully in other 
areas of research, e.g., time series forecast and classification 
0, 10, 0, and form a promising direction of research. In 
the field of compression an ensemble approach is often called 
Context Mixing (CM). 

The most elementary task of the prediction model is se- 
quential probability assignment, i.e., predicting the probability 
distribution P(jjk+i\ykVk-i ■ ■ -yi) of the upcoming symbol 



yk+i based on the already encountered sequence yiy2 ■ ■ - Vk 
over a finite alphabet S. Such a task typically arises when 
working with context models. A finite number of symbols 
preceding y k +i can be used to condition the probability, 
which leads to finite context modelling [5|. The finite context, 
e.g., the character immediately preceding the current one, 
splits the source sequence into sub-sequences. These are often 
called context histories. For instance the context history of 
the context "e" (underlined) regarding the last sentence is 

"s ndxs", an underscore represents a space symbol. This 

work focuses on binary alphabets, S = {0, 1} and uses the 
convention p k = P(y k = 1). 



B. Previous work 

Experiments have shown that a local adaption of the com- 
puted statistics during modelling typically improves compres- 
sion. Thus more recent observations are of higher importance 
for probability assignment [5|, [10|. This observation was 
made more or less accidentally due to limited calculation 
precision, which lead to a periodic rescaling of character 
counts [5 1. Previous work investigated the effect of scaling ifTTIl 
and pointed out an approximate probability estimation model 
for binary sequences based on exponential smoothing |10|. 
Another aspect is the presence of noise within observations. 
An imperfect choice of conditioning contexts will lead to 
observations within context histories, which deviate from the 
governing probability distribution. We consider such events 
as outliers or simply noise. A recent work lfl2l studied the 
effect of a limited probability interval, i.e., 9 £ [a, f3] C [0, 1] 
along with the estimation of the parameter p k = = const 
regarding a series of independent identically distributed (iid) 
binary random variables. A limited probability interval can be 
explained by viewing an observed sequence as the outcome 
of the transmission of the "true" sequence through a noisy 
channel (i.e., an extension to the original source model). 
Results indicate that having knowledge about the parameters a 
and f3 can lead to significant improvements in compression for 
short to medium sized sequences. Thus using the restriction 
Pk G [ct, f3] can represent a countermeasure for noisy observa- 
tions. 



C. Our contribution 

The previous section explained the aspects of observation 
recency and observation uncertainty. Based on these ideas we 
enhance a standard approach for sequential binary prediction 
and introduce a new prediction model. We further employ our 
prediction model to construct a new ensemble compression 
algorithm. This compression algorithm is intended to be used 
as a second step algorithm in Burrows-Wheeler- Transform 
(BWT) based compression. Both, the sequential prediction 
model and the ensemble compression algorithm, contain con- 
stants (fixed during compression or decompression), which 
influence the probability estimation and the compression. We 
denote such constants as parameters of the algorithm or 
parameters of the prediction model (which should not be 
confused with parameters of a distribution). In the general 
setting there are no simple rules for choosing the (unknown) 
parameters. Among the set of feasible parameters, we want 
to chose the parameters according to a certain objective. 
In data compression this objective is the minimization of 
the size of the compressed output. Most of the parameter 
optimization in the area of data compression was carried out 
using ad-hoc hand-tuning, e.g., Ifl3l p. 4], iTPfl p. 6] and 
lTT5l p. 4]. In this work we want to introduce systematic 
approaches to automated parameter optimization, since these 
will improve the compression performance compared to ad- 
hoc hand-tuning. 

We distinguish two versions of automatic parameter opti- 
mization in compression, which we call offline and online 
optimization. Given a training data set the models' parameters 
can be fitted once and remain static during future usage (offline 
optimization). This approach requires a carefully chosen set 
of training data. Since the optimization takes place only 
once and not prior to every compression pass there are no 
significant restrictions on the amount of data and the associated 
processing time. On the other hand, adding an initial opti- 
mization pass prior to compression and saving the parameters 
along with the compressed data refers to an online approach. 
However, there are more severe restrictions on the utilized 
resources. We consider a situation in which the optimization 
pass requires orders of magnitude more time than the actual 
(de-)compression process impractical for online optimization. 
In this work we focus on online optimization and incorporate 
an automated optimization pass into the ensemble model 
mentioned above. Coupling online optimization and statistical 
compression leads to asymmetric statistical compression, a 
new family of statistical compression algorithms. Without 
optimization such algorithms are typically symmetric, since 
modelling and coding is required during compression and 
decompression. Similar approaches to asymmetric algorithms 
exist in the field of audio compression |[T6l . 

There is another non-obvious benefit in using optimization. 
Assume an algorithm A achieves a certain compression rate 
using an ad-hoc parametrization. A computationally cheaper 
algorithm B produces compression comparable to A along 
with optimized parameters. Thus the compression time is 



reduced when A is replaced by B. This argument holds 
especially for offline optimization: The time required for 
optimization does not need to be included in the compression 
time, since optimization is only carried out once. 

The remaining part of this work is divided into four further 
sections. First we present a new elementary, binary prediction 
model, its application to non-binary alphabets and an approach 
to ensemble prediction. Section|III]briefly summarizes iterative 
numeric optimization and its application to the presented mod- 



elling algorithms. Afterwards Section IV evaluates the model 



components' performance and the impact of optimization. 

II. Modelling 
A. Elementary prediction 

As previously mentioned in Section [I] the most es- 
sential task is to estimate the probability distribution 
p n +i = P(Y n+ i = l\B n = b n ) given the series of binary ran- 
dom variables B n = Y1Y2 . . . Y n and an instance b n = 
2/12/2 • • • Vn € {0, 1}™, where to out of n bits are one. 
Assuming iid random variables Y/., i.e., pt — 8 for all k and 
some fixed 8 £ [0, 1], one can calculate the probability of a 
given outcome b n via 



P(B n = b n \0) = J] P(Y k = y k ) = m (l 



(1) 



fe=i 



When b n is fixed an estimation 8 of 8 can be obtained via 
maximizing P(8\b n ), or via minimizing the entropy 



H{8\b n ) = -J2^gP(Y k =y k ) 



(2) 



k=l 



-to log # + (n — to) log(l — 



Note that logarithms are to the base two. The result of min- 
imizing Q is the well-known maximum likelihood estimator 
6 = m/n. Equation |2} is rewritten to yield 



H(b n ) 



k=l 



(lfe]og0+(l-Ifr)log(l-0)). (3) 



Since we assume that the coding cost of more recent events 
is of higher importance, we modify <|3j to become a weighted 
entropy (cf. ifTTIl ) 



H w (b n ) 



k=l 



c fc (y fc log0 + (l-y fc )log(l-0)), (4) 



where < c\ < C2 < • • • < c„ is some weight sequence. 
In this way, the value of 8 is strongly linked to more recent 
observations (steps n, n — 1, . . . ). 

Next we address the aspect of observation uncertainty, 
similar to fl2l . The observations y k are viewed to be the 
outcome of a binary symmetric channel. On the transmitter 
side the outcome y k of a binary random variable Y k is sent 
through the channel. The receiver observes a corrupted bit 



1 — y k with a probability e, i.e., the outcome of a binary 
random variable X k . Summarizing 



P{X k = y k \Y k = y k ) = l-£, 
P(X k ^ y k \Y k = y k ) = e 



(5) 



holds for some < e < 0.5. Thus Q is modified to become 
the expected, weighted entropy 



H w (b n 



c fc (4 logfl + (1 - 4) log(l - 8)) ,(6) 



fe=i 



(l-e)y k +e(l-y k ), 



since we can only observe the receiver side. We assume that 
the statistical properties of the bit sequence do not change 
rapidly (i.e., p n +i ~ Pn) an d approximate p n using the 
solution of the minimum-entropy problem 



p n+ i «p„ = argmini/ u; (& n ), 



which results in 



Pn+l 



En 
k=l°k 



(7) 



(8) 



= e ■ 



(1-fe) 



En 
fe=l C & 

Thus modelling uncertainty via |5]) restricts the probability 
interval to be [e, 1 — e]. As a side effect the problem of 
assigning a probability to the opposite bit y k +\ = 1 — 6 
when processing a deterministic sequence yi = y-2 = ■ ■ ■ = 
Uk = b € {0, 1} is solved. The source model discussed above 
contains several (generally unknown) parameters - the weight 
sequence and e. In order to use the source model for prediction 
these parameters have to be chosen. A bad choice leads to 
redundancy during coding, e.g., in some step k the actual value 
of p k could be located outside of the restricted probability 
interval, but the model is only able to assign values in [e, 
depending on the estimated parameter i, 

B. Efficient approximations 

Equation ([H} can already be utilized to obtain a probability 
estimation given a weight sequence and e. However, from a 
practical point of view and as a matter of convenience an 
estimation should be calculated incrementally, hence we select 
an exponentially decaying weight sequence 



c fe = A n " fe , l<k<n, 
with Ae (0,1]. Equation ([8) becomes 



Pn+l 



where 



S n +1 
T n +1 



+i 



1. 



(9) 



(10) 



(11) 
(12) 
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Fig. 1. The decomposed symbol S3 is encoded in eight consecutive binary 
steps (current bit is boldface) using an order-1 context (underlined), i.e., the 
predictions P(ys = 1 | S2 = 101), P(y7 = 1 | ys = 1, S2 = 101), 
P(yi = 1 I y 2 J/3 • • • Us = 0110110, S2 = 101) need to be calculated. After 
encoding S3, S4 can be processed in the same fashion. 



which can be reformulated to yield an adjustment proportional 
to the prediction error 



Pn+l = Pn + 7^ (Sn-Pn)- 

J-n+1 



(13) 



Initially we have po = 0.5 and Tq = 0. Note that the sequence 
T n is a geometric series and therefore 

1 

» 1 - A' 



(14) 



For a very long sequence exponential smoothing can be used 
as an approximation of ( fl3] >, i.e, 



Pn+l = Pn + (1 - A)((5„ - p n ), 
= \p n + {l-X)8 n . 



(15) 



Depending on the computational resources different approxi- 
mations seem acceptable: 

• Exact model Mi. An estimator state is (p n ,T n ), com- 
puted according to ( |13) , 

• Exponential smoothing M 2 . The state is given by (p n ) 
and is updated following ( |15) , Selecting 1 — A = 2~ l , 
I E N results in a very efficient calculation using bit shifts 
and additions/subtractions only. 

Note that Mi can be approximated more closely by imposing 
an upper limit on T„, or n, respectively. This yields a state 
(p n ,n'), with n' = min(n,n) for a threshold n. The values 
of 1/T n are found using a lookup table and « is set 
according to ( [14) . All approximations described above share 
the same parameters A and e. 

C. Alphabet decomposition and context modelling 

The previous section dealt with the modelling of a bi- 
nary alphabet. In general the compression algorithms work 
on n-ary alphabets S, typically |E| = 2 8 . Hence bitwise 
processing requires an alphabet decomposition, i.e., a map- 
ping code : S — > {0, 1} + and len : S — > N to indicate 
the code length. Without loss of generality we may assume 
that E = {0, 1, 2, . . . , m — 1}. Within this work we use 
a fixed decomposition, which we call "flat decomposition", 
i.e., code(s) = bin(s) (e.g., code(65) = 01000001) and 
len(s) = L = 8 for every symbol s G S. Modelling 



the probability distribution of s is split into len(s) = L 
consecutive steps 

P{s) = P{y L )P{y L . l \y L )...P{y l \y 2 y 5 ...VL)- (16) 

Working with conditional probabilities increases the prediction 
accuracy. A natural choice are order-iV contexts, which have 
successfully been applied to text compression J4], Q. An 
order- N context consists of the last N characters immediately 
preceding the current one. Figure [T] illustrates the bitwise 
modelling process using an order- 1 context. Depending on the 
underlying data other choices can be reasonable as well, e.g., 
the neighbouring pixels in image compression [17|, |18|. 



D. An ensemble predictor 

Section [I] mentioned the successful application of ensemble 
models in other areas. In the area of compression such 
techniques are known 0, but there has been less interest in 
directly applying them. Such techniques allow multiple models 
to contribute with their advantages without cumulating their 
disadvantages [6|. During modelling a probability must be 
calculated for each alphabet symbol s e E, hence combin- 
ing M models roughly requires M ■ |£| operations. On the 
other hand bitwise processing just requires M ■ L operations 
on average, where L is the average code length. Without 
making further assumptions about symbol frequencies, i.e., 
applying the decomposition described in Section II-C we 
get L = L = [log | £ I"]. An advantage of CM compared 
to Prediction by Partial Matching (PPM) is that it does not 
need to handle symbols, which did not appear in the current 
context, in a special way lfl9l . PPM indicates the presence 
of such a situation using an artificial escape symbol, whose 
probability needs to be modelled in every context. However, 
such situations may add redundancy, since there is code space 
allocated for possibly never appearing symbols. This issue 
can be crucial for PPM GUI . A disadvantage of CM is the 
requirement of multiple models simultaneously, which has 
heavy impact on processing speed and memory requirements. 

We now describe the outline of our approach to ensem- 
ble prediction, or CM respectively. It is based on a source 
switching model 0. Consider a set of M sources and a 
probabilistic switching mechanism, which selects source i with 
a probability of w\ (in step k) where w k = 1- Note 

that the switching model should not be confused with Volf's 
switching method (T5|, which is based on switching between 
source coding algorithms rather than constructing an ensemble 
source model. In its current state x\ the selected source emits a 
one-bit with the probability p\ — P(Yk — ^\x l k ). Afterwards a 
state transition takes place for each source resulting in the next 
state x\ +1 . In an analogous fashion the switching probabilities 
may vary, i.e., these may depend on a state, too. Summarizing 
the probability of a one-bit in step k is 

M 

Thus the assumption (or approximation) of a switching source 
results in a linear ensemble prediction (linear mixing). Unfor- 
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Fig. 2. A typical example of BWT output taken from bookl (Calgary Corpus). 



tunately, normally no information about the internals of the 
source (e.g., involved states and transitions) or the characteris- 
tics of the probability assignment is available. The assignment 
is up to the designer. 

E. Applications and test cases 

Single model: To examine the prediction model described 
in Sections |II-A| and |II-B| we will compare its performance to 
the well-known Laplace (LP)- and Krichevsky-Trofimov (KT)- 
estimators |2|, |21| with scaling ifTTl . 

Ensemble model: For testing the ensemble approach we 
introduce a simple ensemble compression algorithm intended 
as a second step algorithm in BWT based compression. 
BWT sorts the characters in its input by context, hence it 
groups similar contexts together 11221 . Since these contexts are 
often succeeded by the same characters BWT output mostly 
consists of long interleaved runs of characters, see Fig. [2] 
Such sequences can be modelled as non-stationary 1131 . We 
model the BWT output as the outcome of a switching source, 
which consists of two individual non-stationary sources. One 
source randomly emits characters independent of the previous 
sequence (order-0), this is intended to model interruptions in a 
single characters run. A second source emits characters based 
on the character immediately preceding the current position 
(order- 1). In contrast to the first source it is intended to model 
the long runs of identical characters. The individual models are 
implemented using the binary predictors described in Section 
[II] We assume the switching probabilities to be constant, i.e., 



= 1 



[0,1]. 



Each individual model presented in Section II-B has two 
parameters A and e. The previously described BWT postpro- 
cessor has five parameters, Ai, £i, A2, £2 and w, respectively. 
Following these observations the next section will provide a 
way of optimizing the parameters. 

III. Optimization 

A. Iterative numeric optimization 

We decompose a model into its structure and parameters. 
Improving the model structure is a task which is typically 
carried out by humans. Model parameters can be fitted au- 
tomatically to a typical training data set. There are dif- 
ferent approaches, depending on the optimization target. A 
differentiable optimization target allows the usage of local 
search procedures, for instance Newton's Method, see standard 



xo i— initial estimation 

k <- 

repeat 

compute a search direction dfe along which / decreases 
perform a line search a*, arg min a / (x*, + adk) 
update the solution x^+i <— x^ + a^dfe 
next step k <— k + 1 
until stopping condition met 

Fig. 3. Basic outline of an iterative numeric minimization algorithm. 



literature on these well-known techniques, e.g., [23]. When no 
derivative information is available (i.e., a non-differentiable 
optimization target) or the search space is highly multimodal 
other stochastic search techniques should be preferred, see e.g., 
[24|. In our setting we want to minimize the average code 
length /, depending on the parameters x of the prediction 
model 

min /(x), (18) 

X < X < X 

where /(x) is given by a modification of <[3j> 



/(x) = -- 



\ (Vk logPfe(x) + (1 - y k ) log(l - Pfc(x))) 



fe=i 



(19) 

Here boldface symbols indicate matrices or vectors. The pa- 
rameter search should take place within the hypercube formed 
by the inequality constraints x € [x, x] C R . In this work 
we want to focus on derivative-based optimization techniques 
based on Quadratic Programming, since / is differentiable. 
Figure [3] shows the typical outline of such an optimization pro- 
cedure. The models described in the previous Section span a 
low-dimensional search space, e.g., x = (Ai,£i, A2,£2, w) T € 
M 5 . Opposed to the small number of parameters a function 
evaluation is, depending on the amount of training data, time 
consuming. It requires to run the corresponding model along 
with the calculation of derivatives. Since we want to use an 
online-optimization approach, the "training data" is the data to 
be actually compressed, i.e., we know it prior to optimization. 

B. Estimating the search direction 

Consider a quadratic approximation /(xfe+dfe) of the target 
function / as a result of the Taylor-expansion at Xfc 

/(x fc + d fc ) « /(x fc ) + V/(x fe ) T d fc + id^V 2 /(x fc )d fc . (20) 

Differentiating ( |20| ) in dfe and solving for its roots yields a 
search direction 



dfc = -V 2 /(x fc )^ 1 V/(x fe ). 



(21) 



The matrix Sfe can either be estimated iteratively or computed 
directly. Given a valid point Xfe € [x, x] a step towards 
dfe might lead to a violation of the constraints. Hence the 
constraints influence the computation of dfe. In order to 
calculate a feasible direction we adopt a slight modification 
of the method in ||25| . which we will now summarize briefly. 



First the index set of binding constraints 

h = 7[.(-V/(xfe)) U 4(-S'feV/(xfe)) (22) 

1 ^ / \ ^ / 

'k 1 k 

is identified depending on 

r k (S) = {i | 4 = 4 a Si < o v 4 = 4 a s t > 0} . (23) 



An element of S J, is given by 



4 j Mai 

, otherwise 



(24) 



depending on the elements s 1 ^ of Sfe. With Si we denote the i- 
th component of S, the same holds for x. t and Xi, respectively. 
The set Ik (5) contains the indices of blocked directions, i.e., 
Xi is located on a constraint boundary and Si points towards 
the constraint. Constraints contained in lfc block movements 
along the directions fulfilling the Karush-Kuhn-Tucker (KKT) 
conditions and l\ blocks movements, which would leave the 
feasible region due to the linear transform described by Sfe. 
Finally given Ik the search direction is obtained via 



-SfeV/(xfe) 



and 







,i,3 $■ h 
, otherwise 



(25) 



(26) 



During the optimization of the parameters of a single model, 
we compute the gradient V/(x) and Sfe = — V 2 /(x) _1 
directly. When carrying out the experiments for the ensemble 
model this turned out to be too expensive computationally to 
be practical for our purposes. Instead of computing Sfe we used 
the Broyden-Fletcher-Goldfab-Shanno (BFGS) approximation 
in conjunction with the Sherman-Morrison formula [23 1, ||25ll 
resulting in a Quasi-Newton step. 

C. Line search 

According to Fig. [3] an estimation of the step length is the 
next step in the optimization procedure. A step along dfe can 
still leave the feasible region, when stepping too far. There is 
an upper limit Hk of a imposed by the non-binding constraints 



Q: A . 



{1}U{^ 

,4<o 
,4>o 



h} 



(27) 
(28) 



In the case of an approximation of Sfe. the line search was 
carried out using quadratic interpolation. The derivative infor- 
mation of 

cf> k (a) = /(x fe + adfe) (29) 

is already available at a — 0. Due to the calculation of V/(xfc) 
and dfe we get 

0fe(O)=d^V/(xfe). (30) 



Now a value ji e (0,a k ] fulfilling 4> k (f3) > <f> k (0) is located. 
The minimum of the interpolation polynomial is given by 



Single model: First the optimization of a single model is 
examined, i.e., x = (A, e) T . We can restate <jSj as 



7 



<^(0)/? 2 



2^(0)/3-(<M/3)-<M0))' 
If (f>k is decreased sufficiently, i.e., 

Mi) < MO) + c^' k (o), 



(31) 



(32) 



where 



Pib(x) =e + (l-2e)g fc (A) 



9fc+i(A) = g fe + = 1 — (yk - 3k), 



where c = 10 5 , we set afc = 7 and the line search is finished. 
Otherwise j3 is replaced with 7 and the process is repeated. 

D. Stopping condition 

The optimization algorithm stops, when all components 
V;/(xfc), i £ Ik are in the range [— T, T], It turned out that the 
precision requirements are rather relaxed, T € [10~ 3 ,1CP 2 ] 
gives satisfying results. A higher request in precision translates 
into compression gains typically below 0.0001 bpc, which can 
be considered insignificant. The number of iterations has been 
limited to 50. 

E. Derivatives 

To perform the optimization process it is necessary to 
calculate the partial derivatives, since these form the gradient 
and the Hessian. For reasons of convenience we introduce 



h(y,p) = -y\np- (1 - y) ln(l -p). 

Note that here In denotes the natural logarithm, 
convention ( fT9| ) becomes 

1 ™ 

/( x ) = —r^^2 h (yk,Pk{x-)) 



in the case of Mi or 

t?fc+i(A) =q k + (l- X)(Vk - Qk)> 



(39) 



(40) 



(41) 



for M2, cf. ( fT3] i and (15) . Thus the required partial derivatives 
of pk can be expressed as 



dpk 

dX 

dpk 
de 
d 2 p k 
dX 2 
d 2 p k d 2 p k 



= (l-2 £ ) 
= 1 - 2q k , 
= (l-2e) 



dqk 

dX 



d 2 q k 
dX 2 ' 



,dqk 



(42) 
(43) 
(44) 
(45) 



dedX dXde dX ' 

Depending on the choice of the model, see Section II-B the 



term q k remains a function of A ( [13) , ( fl"5j ). Utilizing the 
iterative nature of ( fT3] l the expressions for the exact model 
(Mi) are given by 



(33) 


dqk+i 


dqk 




dX 


~~ dX 


Using this 


d 2 q k+ i 


d 2 q k 




dX 2 


dX 2 






1 


(34) 




Tk+i 



, dT k+1 

J ox 



(46) 



&Qk- 



d 2 T t 



fe+i 



dX 2 



Aq k 



d 2 q k 



dX 2 



fc=i 



and a partial derivative w.r.t. Xi, a component of x, is 

df( x ) _ 1 dh(y k ,Pk(x)) 
nln2 ^ 



dxi 



k=l 



dxi 



(35) 



Since y £ {0, 1} we may write 

d n h(y,p) 



dp n 

The first derivative 



(n-1)! 



V 
P 



i-y 



i-p 



dh{y,p) = dh(y,p) dp 
dxi dp dxi 

and the second derivative 

d 2 h(y,p) dh(y,p) dh(y,p) dh(y,p) d 2 p 



dxidxj 



dxi 



dxj 



dp dxidxj 



(36) 



(37) 



(38) 



with the abbreviations 

Aq k 

Mk 
dT k+1 

dX 
d 2 T k+1 

dX 2 



1 

Tk+i 

1 dT k+1 
T k+1 dx 
dT k 



{yk - 



A 



= A 



dX 



dX 2 



,dTk 
' dX 



(47) 

(48) 
(49) 
(50) 
(51) 



Exponential smoothing (M2), ( fl"5j ), yields the following ex- 
pressions: 



dq k +i ,dq k , > 
~dX~ = ~d~X ~ ~ 



d_qk+l 

dx 2 



X 



d qk , n dq k 



dX 2 



dX 



(52) 
(53) 



can easily be obtained. 



For the initial step, k = 0, all derivatives have been initialized 
to be zero. 



TABLE I 

Compression rates (Calgary Corpus) in bpc and average 
context history length l of different order context histories 
for the Laplace, Krichevsky-Trofimov an d the developed Mi 
and M 2 estimators (Section|II-B|i. 



Order 


L 


LP 


KT 


Mi 


M 2 





14793.1 


4.749 


4.731 


4.717 


4.719 


1 


328.5 


3.581 


3.525 


3.528 


3.554 


2 


37.4 


3.252 


3.115 


3.075 


3.222 


4 


5.3 


3.965 


3.671 


3.442 


3.620 


8 


2.2 


5.651 


5.371 


5.013 


5.101 




Fig. 4. Entropy H(\,e) of the order-2 context histories of bib. 



Ensemble model: As stated in Section[n]an ensemble model 
consists of an order-0 and an order- 1 non-stationary model 
(predicting p\ and p\) and a switching probability, or weight 
cj. Thus a point in parameter space is x = (Ai, £i, A2, £2, w) T . 
The expressions for calculating the gradient worked out above 
just need to be modified slightly. Higher order partial deriva- 
tives are estimated using BFGS. The partial derivatives of ( |33j ) 
are given by 



dh(y,p k ) _ dh(y,p k ) dp\ 
dzi dp k d Zi 



(54) 



where Zi <E {Ai, Ex, A 2 , £2}* Pk is the mixed prediction in step 
k, see ( fT7) i, and 




(55) 



Finally the remaining derivative for the ensemble model is 

(56) 

IV. Experimental evaluation 



dh(y,p k ) _ dh(y,p k ) 2 , 



A. Single model 

To evaluate a single prediction model we compare its 
compression performance against the well-known LP and 
KT estimators when forecasting conditional probabilities for 
different order- N context models. Note that N terms the 
number of source symbols, bytes in this case. The flat alphabet 
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Fig. 5. Parameter values (A,e) and a third-order polynomial fit for Mi, 
j!3) , as a function of average context history length L. 



competing models are given by 

Sk + a 



decomposition described in Section II-C has been applied. The 



Pk = " , , (57) 
lk + 2a 

where Sk is the frequency of a one bit, Tk the total number 
of encountered bits in step k and a distinguishes the LP- 
(a — 1) and KT-estimator (a — 0.5) Whenever Tk reaches 
a threshold T the frequencies and Sk are halved (scaling). 
The parameter T G {1, 2, ... , 1024} U {00} was optimized for 
each file. Table [I] summarizes the average compression rates 
per context model for the Calgary Corpus and the average 
context history length. The estimator M\ outperforms all other 
predictors, except in the case of an order- 1 context, where 
its performance is slightly worse than a KT estimator. The 
LP estimator gives the worst overall results, probably due to 
the uniform prior-distribution assumed J2), ETI . Especially for 
short context histories [12], orders 2, 4 and 8, Mi improves 
compression compared to the competing models. Exponential 
smoothing, M2, yields a good approximation when the context 
history contains at least a few hundred observations (order-0 
and order- 1). 

Figure |4] depicts the typical shape of the cost function, (19) , 
for Mi. When A is nearly zero the influence of past obser- 
vations vanishes resulting in an unstable prediction behaviour, 
thus bad compression. A reasonable value of A close to 1 gives 
good results. In the case of e ~ 0.5 virtually no compression 
takes place, since the probability estimates fall within a narrow 
band around 0.5. A small value of e near zero is a good choice. 
The estimator M2 shows similar characteristics. 

Figures [5] and [6] show the optimized values of A and e as 
a function of the context history length L. Shorter context 
histories seem to imply bigger values of e on average. This 
resembles the observation made in [ 12 1, where it is stated that a 
bounded probability interval can show significant compression 
improvements for short sequences. The relation is more pro- 
nounced in the case of Ml, see Fig. [5] The parameter A grows 
as L decreases, again this effect is sharper when observing 
M2 (Fig. |6j. A possible explanation is the variable adjustment 
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Fig. 6. Parameter values (A, e) and a third-order polynomial fit for M2, 
\i5\ , as a function of average context history length L. 

TABLE II 

Compression rates (Calgary Corpus), number of cost 
function ev aluat ions #/j and gradient evaluations #v/ j for 

Mi (SECTI0NS |II-B^ USED IN AN ENSEMBLE MODEL, SEE SECTIOn [II-D| 



File 


fi [bpc] 


#fi 


#Vfi 


f 2 [bpc] 


#fa 


#Vf 2 


bib 


1.945 


16 


10 


1.959 


5 


3 


bookl 


2.248 


11 


8 


2.255 


9 


7 


book2 


1.955 


9 


6 


1.962 


4 


3 


geo 


4.197 


20 


10 


4.199 


17 


13 


news 


2.429 


8 


5 


2.433 


9 


6 


objl 


3.801 


14 


8 


3.752 


14 


7 


objl 


2.435 


8 


4 


2.431 


6 


3 


paperl 


2.449 


8 


4 


2.466 


6 


3 


paper2 


2.368 


8 


5 


2.384 


7 


4 


pic 


0.712 


35 


19 


0.727 


14 


11 


progc 


2.472 


7 


4 


2.477 


7 


5 


progl 


1.703 


9 


7 


1.721 


8 


7 


progp 


1.721 


10 


8 


1.736 


10 


8 


trans 


1.521 


11 


9 


1.528 


11 


9 


Average 


2.283 


12.4 


7.6 


2.288 


9.1 


6.4 



proportional to the prediction error. In Mi the "adaption rate" 
1/Tfc+i ( fT3| l is large initially and decreases, opposed to M 2 
where it is constant. Short sequences require a more rapid 
adaption, thus a constant "adaption rate" 1 — A ( fT5| should be 
high. We believe that the strong dependence of the parameters 
on L in the case of very short sequence (small L) is triggered 
by the small amount of observations rather than the actual 
statistical properties of a context history. 

B. Ensemble model 

In this section we simulate and compare the simple post- 
13 WT-stage algorithm described at the end of Section [TTJ Here 
we want to focus on the use of optimization in an online- 
scenario. After the BWT-output has been generated the model 
is optimized using different initial estimations 



x(M) 



(0.67, 0.002, 0.91, 0.005, 0.44) T , M = M\ 
(0.72, 0.003, 0.96, 0.004, 0.44) T , M = M 2 



(58) 



the results. The estimator Ml shows slightly better com- 
pression than M2 on average, but requires more evaluations 
of the cost function and the gradient for optimization. A 
function evaluation directly corresponds to a compression pass, 
a gradient evaluation is slower, since more calculations are 
required. Oddly, the approximation M 2 outperforms Mi on 
objl and obj2. This indicates that the developed model does 
not fit the data characteristics in this particular case. The files 
geo and pic require many more iterations than the rest of the 
data, the optimal values of x differ significantly from the rest 
of the corpus (and from the initial estimations), e.g., 



(M) 



(0.922, 10" 
(0.931, 10" 



0.997, 0.002, 0.412) 3 
0.951, 0.004, 0.297) 3 



,M 
,M 



Mi 
M 2 ' 
(59) 

From a practical point of view M 2 achieves good compression, 
while requiring less resources during compression and offering 



For decompression the optimized parameters need to be trans- 
mitted along with the compressed data. Table [II] summarizes 



faster model optimization. Finally Tab. Ill compares our best 
results to 

• BW94 E2l - the classical result of Burrows and Wheeler 
using Move-to-Front (MTF) and Huffman-Coding, 

• BS99 ll26l - modified MTF and statistical modeling 
coupled with AC, 

. WM01 1 13 1 - parsing and encoding (via AC) the BWT- 
output and omitting second-stage transformations and 

. D02 E2 - Weighted Frequency Counting (WFC), a 
different post-BWT transform, in conjunction wiht AC. 
All of the other algorithms are rather complex, since they 
include either special post BWT transforms (e.g., MTF or 
WFC) and a statistical model or a sophisticated statistical 
model with a special parsing of the BWT output. Our approach 
is very simple and straight forward, since it just consists of 
a simple statistical model which processes the BWT output 
symbol by symbol (without a special parsing strategy). Taking 
the simplicity of our algorithm as a base it performs very well 
among the other approaches. The ensemble model gains 5% 
over BW94 and 1.3% over WM01. However, it compresses 
circa 1% worse than BS99 and 1.5% worse than D02. In the 
case of bookl, book2 and pic the ensemble model outperforms 
the other algorithms, showing the benefit of an optimized non- 
stationary model. The main drawback of the approach is that 
optimization is time consuming, since the online optimization 
requires to compress its input between 9 (M 2 ) and 12 (Mi) 
times on average (see Tab. [II]). But this can be neglected in 
a "distribution-scenario": compression just takes place a few 
times and the compressed data is distributed, e.g., over the 
internet and needs to be decompressed multiple times. 

V. Conclusion 

In this paper a new approach to modelling non-stationary 
binary sequences was studied and possible low-complexity im- 
plementations have been shown. Using an iterative parameter- 
optimization method the parameters of the model can be fitted 
to training data automatically. In all test cases the new model 
shows a good performance compared to the LP- and KT- 
estimators. Both classic estimators are surpassed except in 



TABLE III 

Compression in bpc of various BWT-based algorithms against 
our best result. 



File 


BW94 


BS99 


WM01 


D02 


best 


bib 


2.020 


1.910 


1.951 


1.896 


1.945 


bookl 


2.480 


2.270 


2.363 


2.274 


2.248 


book2 


2.100 


1.960 


2.013 


1.958 


1.955 


geo 


4.730 


4.160 


4.354 


4.152 


4.197 


news 


2.560 


2.420 


2.465 


2.409 


2.429 


objl 


3.880 


3.730 


3.800 


3.695 


3.801 


obj2 


2.530 


2.450 


2.462 


2.414 


2.435 


paperl 


2.520 


2.410 


2.453 


2.403 


2.449 


paper2 


2.500 


2.360 


2.416 


2.347 


2.368 


pic 


0.790 


0.720 


0.768 


0.717 


0.712 


progc 


2.540 


2.450 


2.469 


2.431 


2.472 


progl 


1.750 


1.680 


1.678 


1.670 


1.703 


progp 


1.740 


1.680 


1.692 


1.672 


1.721 


trans 


1.520 


1.460 


1.484 


1.452 


1.521 


Average 


2.404 


2.261 


2.312 


2.249 


2.283 



one case, where our models show slightly worse results. Thus 
in the case of compressing non-stationary data the presented 
models typically improves compression. Beside the usage as a 
binary predictor on its own an ensemble model based on two 
non-stationary submodels for compressing BWT output has 
been designed. An alphabet decomposition is required to map 
the n-ary alphabet to a binary sequence, so the binary predictor 
can be used. The ensemble model contains an optimization 
pass prior to the actual compression. Such a simple ensemble 
model, together with online-optimization, shows good com- 
pression performance. Note that the ensemble model is very 
simple and does not apply any parsing strategies or post BWT 
transforms - it directly models symbol probabilities of plain 
BWT output. In order to make such an approach more practical 
further steps need to be taken to speed up the optimization 
process. Combining multiple models in data compression is 
highly successful in practice, but more research in this area is 
needed. 
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